Using Variational Inference and MapReduce to Scale Topic Modeling
نویسندگان
چکیده
Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for exploring document collections. Because of the increasing prevalence of large datasets, there is a need to improve the scalability of inference of LDA. In this paper, we propose a technique called MapReduce LDA (Mr. LDA) to accommodate very large corpus collections in the MapReduce framework. In contrast to other techniques to scale inference for LDA, which use Gibbs sampling, we use variational inference. Our solution efficiently distributes computation and is relatively simple to implement. More importantly, this variational implementation, unlike highly tuned and specialized implementations, is easily extensible. We demonstrate two extensions of the model possible with this scalable framework: informed priors to guide topic discovery and modeling topics from a multilingual corpus.
منابع مشابه
Tag-Weighted Topic Model For Large-scale Semi-Structured Documents
To date, there have been massive Semi-Structured Documents (SSDs) during the evolution of the Internet. These SSDs contain both unstructured features (e.g., plain text) and metadata (e.g., tags). Most previous works focused on modeling the unstructured text, and recently, some other methods have been proposed to model the unstructured text with specific tags. To build a general model for SSDs r...
متن کاملModels, Inference, and Implementation for Scalable Probabilistic Models of Text
Title of dissertation: Models, Inference, and Implementation for Scalable Probabilistic Models of Text Ke Zhai, Ph.D., 2014 Dept. of Computer Science Dissertation directed by: Professor Jordan Boyd-Graber iSchool, UMIACS Unsupervised probabilistic Bayesian models are powerful tools for statistical analysis, especially in the area of information retrieval, document analysis and text processing. ...
متن کاملPrivate Topic Modeling
We develop a privatised stochastic variational inference method for Latent Dirichlet Allocation (LDA). The iterative nature of stochastic variational inference presents challenges: multiple iterations are required to obtain accurate posterior distributions, yet each iteration increases the amount of noise that must be added to achieve a reasonable degree of privacy. We propose a practical algor...
متن کاملTopic Models
with the most likely topic assignments FIGURE 4. The analysis of a document from Science. Document similarity was computed using Eq. (4); topic words were computed using Eq. (3). the assignment of words to topics in the abstract of the article, and the top ten most similar articles. 3. POSTERIOR INFERENCE FOR LDA The central computational problem for topic modeling with LDA is approximating the...
متن کاملDistributed Flexible Nonlinear Tensor Factorization for Large Scale Multiway Data Analysis
Tensor factorization is an important approach to multiway data analysis. However, real-world tensor data often encompass complex interactions among tensor elements, and are extremely sparse and of huge size. Despite the success of exiting approaches, they are either not powerful enough to model the complex interactions or extreme sparsity in data. To overcome these limits, we propose a new tens...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1107.3765 شماره
صفحات -
تاریخ انتشار 2011